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(54) Text compression options generation 

(57) A text processor processes text in a message. 

The text processor generates a plurality of compressed 
forms of components of the message. The processor 
perfomis a linguistic analysis on the body of text to ob- 
tain a linguistic output indicative of linguistic compo- 



nents of the body of text. The processor then generates 
the plurality of compressed forms that can be use^\to 
compress the body of text. The plurality of compressed 
fonns are generated based on the linguistic output. The 
invention can be implemented as a method of generat- 
ing the compressed forms and as an apparatus. 
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Description 



BACKGROUND OF THE INVENTION 



5 . [0001] The present invention deals with messaging on devices with limited display space. More specifically, the 
, present invention deals with compressing text, in a linguistically Intelligent manner, such that it can be more easily 
\ displayed on small screens. 
[00021 Messaging is widely available on cun-ent computer systems. Messages can be sent through voice mall, elec- 
tronic mall (email), paging, and from other sources or means. Further, the messages from a variety of sources can be 
10 integrated and forwarded to a single device. For example, a user who is currently receiving messages at a computer 
or computer network through voice mail and electronic mail may fonward those messages to a cellular phone equ'pped 
to receive such messages. However, the screen of a cellular phone has quite limited display space. This can present 
significant problems when trying to display messages. 

[00031 For example, even very short electronic mail messages, or transcribed voice mail messages, can present 
'5 text which Is too voluminous to be viewed on a single screen of a cellular phone. This often requires the user to either 
decipher an entire message from the first few words of the message (since that Is all that can be displayed), or to scroll 
down through many lines of text in order to read the entire message. Both approaches are cumbersome and can lead 
? to errors. 

/ [00041 While text compression has conventionally been used In many different contexts, the purpose of such com- 
?o pression has primarily been to enable efficient data storage of text. Such compression techniques are completely 
inapplicable to contexts in which the compressed text must be deciphered by humans. 

SUMMARY OF THE INVENTION 

?5 [0005] A text processor processes text in a message. The text processor generates a plurality of compressed fonns 
.. . of components of the message. The processor performs a linguistic analysis on the body of text to obtain a linguistic 
I output indicative of linguistic components of the body of text. The processor then generates the plurality of compressed 
fonro that can be used to compress the body of text. The plurality of compressed fonns are generated based on the 
linguistic output. The invention can be implemented as a method of generating the compressed fonns and as an ap- 



[00061 Another aspect of the invention Includes a data structure generated based on the linguistic analysis of the 
text. The data structure includes a plurality of fields that contain attributes indicative of the plurality of compressed 
fomns of portions of the body of text. The data structure can also include a compression type field indicative of a type 
of compression used to generate at least one of the attributes contained in the fields of the data structure. 

BRIEF DESCRIPTION OF THE DRAWINGS 



FIG. 1 is a block diagram of an embodiment in which the present invention may be used. 
FIG. 2 is a blocic diagram of a message handler for performing linguistic analysis in accordance with one embod- 
iment of the present invention. . 

FIG. 3 is a diagram of a portion of a syntax parse tree for an exemplary sentence. 
FIG. 4 is a flow diagram of the overall operation of the system shown in FIG. 2. 

FIGS. 5A and SB are more detailed flow diagrams illustrating the operation of the system shown in FIG. 2 in 
generating compression options for temiinal nodes (or words and punctuation) in a syntactic analysis. 

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS 

\ [00081 FIG- 1 illustrates an example of a suitable computing system environment 100 on which the invention may 
be Implemented. The computing system environment 100 is only one example of a suitable computing environment 
and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the 
computing environment 1 00 be interpreted as having any dependency or requirement relating to any one or combination 
of components Illustrated In the exemplary operating environment 100. 

[00091 The invention is operational with numerous other general purpose or special purpose computing system en- 
vironments or configurations. Examples of well known computing systems, environments, and/or configurations that 
may be suitable for use with the invention include, but are not limited to. personal computers, server computers, hand- 
held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable con- 
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sumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that in- 
clude any of the above systems or devices, and the like. 

[001 01 The invention may be described in the general context of computer-executable instructions, such as program 
modules, being executed by a computer. Generally, program modules include routines , programs, objects, components, 
5 data structures, etc. that perfomi particular tasks or implement particular abstract data types. The invention may also 
be practiced in distributed computing environments where tasks are perfomied by remote processing devices thatare 
linked through a communications network. In a dtetributed computing environment, program modules may be locked 
In both local and remote computer storage media including memory storage devices. 

[0011] With reference to FIG. 1, an exemplary system for implementing the invention includes a genera! purpose 
10 computing device in the form of a computer 110. Components of computer 1 1 0 may include, but are not limited to, a 
processing unit 1 20. a system memory 1 30, and a system bus 1 21 that couples various system components Including 
the system memory to the processing unit 120. The system bus 121 may be any of several types of bus structures 
Including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architec- 
tures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus. Micro 
IS Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local 
bus, and Peripheral Component Interconnect (PCI) bus also known as Mezzanine bus. 

[001 2] Computer 1 1 0 typically Includes a variety of computer readable media. Computer readable media can be any 
available media that can be accessed by computer 1 1 0 and includes both volatile and nonvolatile media, removable 
and non-removable media. By way of example, and not limitation, computer readable media may comprise compSter 

20 storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable 
and non-removable media implemented In any method or technology for storage of infomnation such as computer 
readable Instructions, data structures, program modules or other data. Computer storage media Includes, but is not 
limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) 
or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage 

25 devices, or any other medium which can be used to store the desired infomnation and which can be accessed by 
computer 100. Communication media typically embodies computer readable Instructions, data stmctures, program 
modules or other data In a modulated data signal such as a carrier WAV or other transport mechanism and includes 
any infonnation delivery media. The term "modulated data signal" means a signal that has one or more of its charac- 
teristics set or changed in such a manner as to encode infomiation in the signal. By way of example, and not limitation, 

30 communication media includes wired media such as a wired nelworic or direct-wired connection, arid wireless media 
such as acoustic, PR, infrared and other wireless media. Combinations of any of the above should also be Incluaed 
within the scope of computer readable media. 

[0013] The system memory 1 30 includes computer storage media in the form of volatile and/or nonvolatile memory 
such as read only memory (ROM) 131 and random access memory (RAM) 132. A basfc input/output system 133 
35 (BIOS), containing the basic routines that help to transfer information between elements within computer 11 0, such as 
during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are 
immediately accessible to and/or presently being operated on by processing unit 120. By way o example, and not 
limitation, FIG. 1 illustrates operating system 1 34, application programs 1 35. other program modules 1 36, and program 
data 137. 

40 [0014] The computer 110 may also Include other removable/non-removable volatile/nonvolatile computer storage 
media. By way of example only, FIG. 1 illustrates a hard disk drive 141 that reads from or writes to non-removable, 
nonvolatile magnetic media, a magnetic disk drive 151 that reads from or writes to a removable, nonvolatile magnfetic 
disk 152, and an optical disk drive 155 that reads from or writes to a removable, nonvolatile optical disk 156 such as 
a CD ROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media that 

^ can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash 
memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. The hard disk 
drive 141 is typically connected to the system bus 121 through a non-removable memory interface such as Interface 
140, and magnetic disk drive 151 and optical disk drive 155 are typically connected to the system bus 121 by a remov- 
able memory interface, such as interface 150. 

so [0015] The drives and their associated computer storage media discussed above and illustrated in FIG. 1 , provide 
storage of computer readable instructions, data structures, program modules and other data for the computer 110. In 
FIG, 1 , for example, hard disk drive 141 is illustrated as storing operating system 1 44, application programs 145, other 
program modules 1 46, and program data 1 47. Note that these components can either be the same as or different from 
operating system 1 34, application programs 135, otherprogram modules 136, and program data 137. Operating sysfem 

55 144, application programs 145, other program modules 146, and program data 147 are given different numbers here 
to illustrate that, at a minimum, they are different copies. 

[0016] A user may enter commands and information into the computer 110 through input devices such as a keyboard 
162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not 
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shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often 
connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be 
connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A 
monitor 1 91 or other type of display device is also connected to the system bus 1 21 via an interface, such as a video 
5 ' interface 1 90. In addition to the monitor, computers may also include other peripheral output devices such as speakers 
197 and printer 196, which may be connected through an output peripheral interface 190. 

[0017] The computer 110 may operate in a networked environment using logical connections to one or more remote 
, computers, such as a remote computer 180, The remote computer 180 may be a personal computer, a hand-held 
\ device, a server, a router, a network PC. a peer device or other common network node, and typically includes many 
10 ' or all of the elements described above relative to the computer 1 1 0. The logical connections depicted in FIG. 1 1nclude 
a local area network (LAN) 171 and a wide area networic (WAN) 173, but may also include other networks. Such 
networkirig environments are commonplace in offices, enterprise-wide computer networics, intranets and the Internet. 
[0018] When used In a LAN networking environment, the computer 110 is connected to the LAN 171 through a 
networi< interface or adapter 170. When used in a WAN networidng environment, the computer 110 typically includes 
^•X ^ 1 72 or other means for establishing communications over the WAN 173, such as the Internet. The modem 

\ 1 72, Which may be Internal or exlemal, may be connected to the system bus 1 21 via the user Input Interface 1 60, or 
other appropriate mechanism. In a networiced environment, program modules depicted relative to the computer 1*10, 
or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation, FIG. 
1 illustrates remote application programs 185 as residing on remote computer 180. It will be appreciated that the 
20 network connections shown are exemplary and other means of establishing a communications link between the com- 
, puters may be used. 

1001 9] It should be noted that the present Invention can be carried out on a computer system such as that described 
with respect to FIG. 1 . However, the present invention can be carried out on a server, a computer devoted to message 
handling, or on a distributed system In wh Ich different portions of the present Invention are carried out on different parts 
25 of the distributed computing system. 

[0020] FIG. 2 Is a block diagram of one illustrative embodiment of a number of components that can be used to 
..V Implement the present invention. FIG, 2 Includes a message handler 200 a compressor 202 and a target device 204. 
Message handler 200 illustratively includes a message parser 204, linguistic analyzer 206 and text compression com- 
ponent 208. In one illustrative embodiment, tarpet device 204 is a cellular phone or other small screen device which 
is connected to compressor 202 through link 210. Link 210 can be a global computer networic that may or may not 
irictude radio transmission portions, or any other suitable link for transmitting messages to target device 204. 
10021] Message handler 200 illustratively receives message 212. Message 212 can be from one of a variety of 
sources, including a paging system, electronic mail, voice mail, etc. Message 212 thus illustratively includes a variety 
of parts including a header, a body of text, and, in the case of email, previous messages in the email thread. Parser 
204 parses message 212 into its various parts. The operation of parser 204 is In-elevant to the present Invention. All 
that is relevant is that a message body 214, or other textual body to be compressed, is identified and provided to 
analyzer 206. This can be done in any known way and does not form part of the present invention. Therefore, parser 
204 will not be described in detail. Suffice it to say that parser 204 may remove header information and possibly previous 
\>)^ mail messages, and provide the message body 21 4 to linguistic analyzer 206, 
[0022] Of course, it should be noted that parser 204 may provide any other natural language body of text to analyzer 
206. otherthan message body 21 4. Forexample.thebody of text may be a subject header, a task description header, 
a web page, etc. The present discussion proceeds with respect to message body 214 as but one example of text to 
be analyzed. 

[0023] Linguistic analyzer 206 illustratively includes a lexical analyzer, a morphological analyzer, and a syntax ana- 
45 lyzer. The lexical analyzer receives message body 214 and breaks it into words (or other tokens). This is done in a 
known manner. The morphological analyzer accesses a morphological data base (such as a dictionary) and obtains 
a variety of infonnation associated with each word (or token), such as the meaning, the part-of-speech, etc. The syn- 
tactic analyzer performs a syntactic analysis of the message body 214 to obtain a syntactic parse tree (or syntactic 
analysis structure) for each sentence in the message body and outputs that structure as the output of linguistic analyzer 
206. This Is also done in a known manner and is briefly illustrated with respect to FIG. 3. 
\ [0024] Text compression component 208 accesses the linguistic analysis output by linguistic analyzer 206 and gen- 
erates a plurality of different optional compressions of the components of message body 214. In one illustrative em- 
bodiment, text compression component 208 provides five attributes for each word or phrase in message body 214. 
Generally, each of the attributes represents a more aggressive compression of each word under analysis. In one 
illustrative embodiment, the data structure output by text compression component 208 Includes the following attributes: 

ShorlType which designates one type of compression rules being applied; 
LongFonn which is the form of the word as written in message body 214; 
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Short Form which is the form of the word after applying the compression rules or techniques identified by^he 
ShortType attribute; 

CaseNormalizedForm which capitalizes ttie first letter in the ShortForm and provides the remaining letters in lower 
case; and 

CompressedFomri, which is a compressed form of the CaseNomnalizedFomn and subjects the CaseNonmalized- 
Fonfn to additional compression rules in an effort to further compress the word. 

[0025] In one Illustrative embodiment, the data structure Including these attributes Is output as a compressed XML 
output 216 and is provided to the compressor component 202. Compressor component 202 may illustratively choose 
one of the compressed forms In the compressed output 21 6 and provide it to target device 204. Compressor component 
202 may illustratively choose the compressed fomi based on the screen space available on target device 204, or other 
criteria. It should be noted that compressor component 204 does not fomi part of the present Invention. 
[0026] FIG. 3 is one Illustrative embodiment of a sentence which may reside in a message body 21 4. The sentehbe 
reads "You have a meeting with Dr. John Epstein next Tuesday at ten a.m." Of course, message body 214 is provided 
to the lexical analyzer which breaks the message body into sentences and Into individual words (or tokens). The mor- 
phological analyzer then performs a look up of each word (or token) and Identifies part*of-speech and other possible 
infomnatlon desired for analysis. Therefore, it can be seen that the words are identified with the parts-of-speech as 
follows: 



you = pronoun 
have = verb 
a = article 
meeting = noun 

with = preposition ^ 
Dr. John Epstein = proper noun '* 
next = adjective 
Tuesday = noun 
at = preposition; and 
ten a.m. = noun. 

[0027] The syntactic analyzer analyzes the sentence and parts-of-speech into a syntax parse tree, in one illustrative 
embodiment, as Indicated In FIG. 3. The temilnal nodes (or leaf nodes) In the syntax parse tree represent the words 
in the sentence, while the non-temninal nodes represent phrases or other upper level syntactic units Identifying portions 
of the sentence. In the syntax parse tree illustrated in FIG. 3, the designation "S" represents a sentence node, while 
the designation "NP" represents a noun phrase, "VP" represents a verb phrase, and "PP" represents a prepositional 
phrase. The triangles above "next Tuesday" and "at ten am." simply indicate that those phrases can be furtheranalyj^ed 
into nodes which have been eliminated for the sake of simplicity. The syntax parse tree indicates that the sentence is 
formed of a noun phrase, followed by a verb phrase, followed by two other syntactic components which are not spe- 
cifically analyzed herein. 

[0028] Text compression component 208 Illustratively compresses the sentence shown in FIG. 3, in a linguistically 
intelligent manner, such that it can be deciphered by a human. In performing such compression, a number of problems 
present themselves. For example, it nnay be Intuitive to delete all of certain types of words in the text. For instance, it 
may be Intuitive to delete all articles in the text. However, while this may work in English, it does not work In other 
languages. In fact, it does not even work in all of the Romance languages. Take for example, the French phrase Je le 
lui ai fait manger which Is translated as "t made him eat it." it should be noted that the clitic pronoun "le" looks exactly 
like the definite masculine article "le" (which is translated as "the"). Therefore, if ait "articles" or words "the" and their 
equivalents in the different languages were removed, this would drastically change the meaning of some phrases In 
different languages. \ 
[0029] Similarly, It may seem intuitively reasonable to remove ail spaces in the text. However, where electronic mail 
aliases or uniform resource locators (URLs) are provided in the message, removing the spaces would make it very 
difficult to tell where the email aliases or URL reside within the text. Many such symbol sensitive text fragments are 
used in messages today. If case or symbols are changed in the fragment, the entire fragment irretrievably loses its 
meaning. Take, for example, the phrase "Visit http://microsoft.com for infomiation". If this were reduced to "visithttp:// 
microsoft.comforinfo" it is very difficult to determine where the URL ends within the text fragment. 
[0030] Therefore, the present Invention does not take such an unintelligent and uniform approach. Instead, the 
present invention bases its compression on the linguistic analysis performed by analyzer 206. 
[0031] FIG. 4 is a flow diagram which illustrates in a bit greater detail the operation of message handler 200. First, 
message handler 200 receives message 212. This Is indicated by block 218. Parser 204 locates the message body in 
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^ message 21 2 and passes message body 214 to analyzer 206. This is indicated by block 220. Analyzer 226 breaks the 
• message 214 into sentences. This is Indicated by block 222. The lexical analyzer component of analyzer 206 then 
' perfomis a lexical analysis of the text body to break the sentences Into tokens such as words, numbers and punctuation 

symbols. Tokens can also consist of more than a single word, such as multi-word expressions like "along with" or "by 
5 means or. This is Indicated by block 224. The morphological analyzer in linguistic analyzer 206 then perfomis Its 

morphological analysis and thus locates parts-of-speech, and other relevant Information corresponding to each token. 

This Is indicated by block 226. The syntactic analyzer then perfomis a syntactic analysis and provides, In one illustrative 

embodiment, a syntax parse tree. This Is Indicated by block 228. 

[00321 Text compression component 208 then Iteratively examines each of the nodes in the analysis provided by 
10 analyzer 206 to determine whether potential compression options are available. This is indicated by block 230. Once 
the nodes in the analysis have been examined, and the various compression options have been identified, the com- 
presslon options are output, as, for example, an XML output 21 6. This is Indicated by block 232. Compressor 202 then 
simply chooses one of the options for each word (or token) and provides the message In compressed fonn to target 
, device 204. 

IS [0033] FIGS. 5A and 5B illustrate in better detail the operation of text compression component 208 in generating the 
potential compression options for the analyzed portions of message body 214. FIGS. 5A and SB specif teally Illustrate 
the operation of text compression component 208 In generating possible compression options for temiinal nodes (or 
leaf nodes) In the analysis output by analyzer 206. In other words, FIGS. 5A and SB Illustrate the treatment of each 
word (or token) In the text message for potential compression, as opposed to non-temiinal nodes whteh may represent 
20 phrases or larger fragments of the message body. 

[0O34] First, the long form of each token is received. Recall that the long fomi Is the form of the token.whlch Is written 
in the text body. This is indicated by block 234 in FIG. 5A. The long form Is saved as an attribute that is output In the 
data structure provided as the compressed output 216. This Is Indicated by block 236. 
\ [0035] Next, the ShortType attribute Is detemiined and saved. Recall that the ShortType attribute Is an attribute that 
' indicates the specific type of compression rules applied to the long forni of the token. This is indicated by block 238. 
The various ShortType attributes in accordance with one embodiment of the present Invention are discussed at greater 
length below. 

[0036] It Is then determined whether, using the compression rules Identified by the ShortType attribute, the entire 
node under analysis Is to be deleted. For example, some nodes are to be deleted under all circumstances. Articles 
(which have a ShortType attribute "Articles") in the English language can always be omitted. Such articles include a, 
the, those, and these, for example. Greetings have ShortType attribute "Greeting" and are also specially handled In 
block 240. Greetings (such as Dear Bob, HI. and HI BOB) can all be deleted. Determining whether the node Is to be 
deleted under all circumstances Is Indicated by block 240. If so, then as indicated in block 238, the ShortType attribute 
is set to "Articles" (or whatever Is appropriate) and the ShortForm, the CaseNormalizedForms, and the Compressed- 
35 . Form attributes are all set to a null value. This is indicated by block 242. 

.1 [0037] If. at block 240. it Is determined that the node Is not to be deleted, In its entirety, it Is determined whether any 
other special handling forthls node is to be undertaken. This is indicated by block 244, Such special handling can take 
a wide variety of forms, A number of those forms will now be discussed. 

[0038] A group of adjectives (having the ShortType "Adjective") are specially handled. Those include words which 
40 begin. with "wh", such as which, who and what. Those adjectives are discussed in greater detail below. 

[0039] English articles were discussed above with respect to block 240. English articles can be omitted under all 
circumstances. However, articles In other languages may need special handling. For example, German definite articles 
can be omitted under alt circumstances. However, Indefinite articles are retained because of ambiguity (since the same 
form can mean "a" or "one"). Spanish and French definite articles are deleted, but clitic pronouns with the same spelling 
are not. Indefinite articles in Spanish and French are retained because of ambiguity (since the same form can mean 
"a" or "one"). 

[0040] Adverbs have the ShortType attribute "Adverbs" and those that are classified as "wh" words (why, how, when , 
etc.) are not compressed in any fashion, and are dealt with below. Other adverbs undergo character reduction (such 
as vowel deletion, consonant deletion or both) which is also discussed in greater detail below. 
[0041] Company names have ShortType attribute "Company" and are also specially handled. The company type is 
deleted. For example, "Microsoft Corporation" can be converted to simply "Microsoft". The shortened form is subject 
to character reduction and case nomialization as discussed below. 

[0042] Conjunctions have the ShortType attribute "Conjs" and are specially handled as well. For example, the English 
conjunction "and", the French "er and the German "und" are replaced with the ampersand sign. The Spanish "y/e" is 
not reduced since It is already one letter. All other conjunctions are left as Is, and are subjected to the later processing 
steps. 

[0043] A number of different types of nouns are specially handled as well. Absolute dates and times are designated 
with the ShortType "Dates" and are treated in the following way. In all languages, for a month in isolation, the long 
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month name is converted to ashort fomn. Short month names with periods at the end have the period removed. Vowel 
compression, case nomnafization, etc. are not performed on the resulting short fomn. For example, in the phrase "lets 
meet In November" November is reduced to "Nov". Similarly, the phrase "lets meet In Nov.", has the November abbre- 
viation converted to "Nov" (i.e., the trailing period is stripped). 

5 [0044] In all languages, a month (and year) with no day of the month designated is rendered as a short month name 
alone. For example, the term "November 2001 " where "2001 " Is the present year, is simply reduced to "Nov". \ . 
[0045] If the date is a month plus a year that is not the cument year, it Is converted to a numeric month plus a separator 
plus a numeric year. For example, "Nov 2002" Is converted to "11/2002" (for the English and French languages) or 
"11 .2002" (for other European languages). 

10 [0046] Similarty, In the American English language, single absolute dates are nonnalized to month/day/year numer- 
ical format. Dates in other languages are normalized to their fomiats (e.g., Japanese always uses the year-month-day 
format). In English and French the fon/vard stash mark is used as the separator while in Spanish and German the period 
is used as the separator. 

[0047] The year is omitted If it is equal to the year of "today" of if the year plus 2000 Is equal to the year of "today". 

IS For example, 23 July, 2001 Is converted to 7/23. In addrtlon, I^onday 23 July is converted to 7/23. 

[0048] Similariy, midnight receives special handling as welt. IVIIdnight is also designated by the Shortlype "Dates" 
and its short form Is "1 2am". The common collocation "1 2 midnight" also has the short form "1 2am". a special case to 
avoid the output "12 1 2am". V \. 

[0049] Date ranges in the English language are also subject to special handling. For example, the temi "December 

20 5th-9th" is converted to "12/5-9", Also, the date range December 5th - 9th, 2002" is converted to "12/5-9/2002", 

[0050] Offset dates are also treated specially and are given the ShortType "OffsetDate". In the event that a temi such 
as "next Wednesday" is Identified In the text, the date on which the message is sent (or authored) Is obtained and the 
offset date "next Wednesday" is resolved. Therefore, if the message was sent on Friday, December 1 st, the reference 
to "next Wednesday" would be December 6th. The tenri "next Wednesday" would thus be converted to "12/6". 

25 [0051 ] The days of the week are given the ShortType "Days". In alt languages, isolated days of the week that cannot 
be reliably resolved to absolute dates are converted to the short fomis of those days. Short day names with periods 
at the end have the periods stripped therefrom. Vowet compression, case normalization, etc. are not performed on the 
resulting short fomi. For example, in the phrase "lets meet on Monday", the tenn "i^onday" is converted to "Mon". 
[0052] Electronic mail aliases and URL's are also subject to special handling. Electronic mail aliases and URL's.are 

30 maintained, intact, without case normalization or removal of vowels. Emails are given the ShortType "Email" and URiL's 
are given the ShortType "URL". 

[0053] Phone numbers are given the ShortType "Phone" and have punctuation removed from the Interior thereof. 
For example, the phone number in the terni "call me at (425) 703-7371" is simply converted to "4257037371". 
[0054] States and countries are given the ShortType "Geo" and are replaced with their conventional abbreviations. 
35 For example, "Washington" is replaced by "WA", "Alabama" Is replaced by "AL", etc. 

[0055] Non-language items are given the ShortType "NotLanguage" and linguistic compression Is not perfonned. 
Examples of such items include: 



[0056] Spelled out numbers are also subject to special handling and are given the ShortType "Number". Spelled out 
numbers are replaced with Arabic numerals. For example, the English phrase "one thousand four hundred twenty-five" 

50 is replaced by "1425". Separators are Illustratively not used between thousands. 

[0057] Denominations of money are also subject to special handling and are provided with the ShortType "Dollars". 
The temn "K" is substituted for thousands. The term"M" is substituted for millions and "B" is substituted for billions. For 
example, $100,000 is converted $100K, $123,000,000 is converted to $123M, and $2,000,000,000 is converte^ to 
$2B. Also, these short fomns are not subject to case nomialization which will be described below. 

55 [0058] Similariy, in one illustrative embodiment, fractions are indicated as well. For example, $2,250,000,000 is con- 
verted to $2.25B. Also, numerical amounts which are followed by a currency designator are normalized to the common 
symbol tor the currency along with the number. For example, "one hundred dollars" is converted to "$100". The temi 
"57 pounds" is converted to "#57". "500 Francs" is converted to "500Fr", etc. 
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X = x + y; 



If (x=1){ 
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< Some XML > Content < /Some XML > <Foo/>. 
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[0059] Proper names are subject to special handling and are given the ShortType "PrprN". In languages other than 
Gemian, multi-part proper names are condensed down to Just the first family name, if possible. For example, "Dr. Mary 
Smith" is converted to "Smith", 

[0060] It should be noted that for Spanish phrasal last names, they are condensed to the first part (e.g.. "Cardoso 
de Campos" is reduced to "Cardoso"). Also, in one illustrative embodiment, vowel removal is not conducted on proper 
names. 

[0061 J Similarly, proper names are subjected to dictionary lookup for more common given names. For example, the 
'}'- proper name "Patrick" may be replaced by "Paf. The name "William" may be replaced by "Will", etc. Further, if a given 
name and a final initial are provided, this is reduced just to the first name. 

[0062] In the Gennan language, proper names are more troublesome, because the language capitalizes many words 
In text fragments. Therefore, proper names are not compressed when they are preceded by determiners in the German 
language. 

[0063] Possessives are also specially handled and are given the ShortType "Possessive". In the English language, 
possessives with the "'s" and "s"' clitics can be rewritten vyithout the apostrophe. For example, the temi "John's house" 
can be written as "Johns house*. Similarly, the "dog's tails" can be written as "dogs tails", 

[0064] A number of prepositions are subject to special handling as well and are given the ShortType "Preps". For 
example, In the English language, some prepositions are summarized through a look up table. For Instance, "through" 
can be summarized as "thru". The word "at" can be summarized with "@". The temis "to" and "for" can also be sum- 
' i marized as the numbers "2" and "4" in certain circumstances. They are only summarized in this way If they are not 
adjacent to a numeral or a number spelled out in full that has a possible numeral substitution. For example, in the 
' phrase "I want to leave", the terni "to" is replaced by the number "2". However, in the phrase "I have been to two good 
movies lately" the temi "to" Is not changed to the number "2" since this would result in a possible misconstrual that the 
speaker had been to twenty-two good movies. 

[0065] Some pronouns are also subject to special handling and are given the ShortType "Pronouns". For English, 
the pronoun "you" is replaced "U". AH other pronouns stay the same, with no vowel removal. For Spanish, the pronoun 
"Usted" Is replaced "Ud" and "Ustedes" by "Uds". In the German language, the pronouns that Include "ein" (plus in- 
flection) are summarized using the numeral "1 ", 

[0066] Punctuation is specially handled and is given the ShortType "Punctuation". Punctuation that is not a sentence 

separator and does not occur inside an email alias or URL is deleted. Essential punctuation is given the ShortType 
v\ "EssentialPunct". For all languages, the following characters are not deleted: - : I ^ ? I [] 0 <> = == " Mn Japanese, 
% the special small circle symbol which is used exclusively as a sentence separator is not deleted either. The semicolon. 

and period are deleted only if they are not sentence-final punctuation. All other characters are marked as Nonessen- 

tialPunctuation (described below). 

[0067] However, in one embodiment, sequences of final punctuation are reduced to the first character. Therefore, a 
55 phrase such as "Are these things removed?!?" simply has Its final punctuation reduced to "?". 

[0068] Also, for all languages, punctuation that occurs between items which, under other compression rules, may 
be rendered as digits, are retained. For example, in the phrase "I bought 3 in 1 976 and In 1 977, 100" the comma atter 
1 977 is retained (or optionally a space Is retained) in order to avoid the compression 1 9771 00 and to Instead have the 
compression "1977,100" or "1977 100". 
40 [0069] Similariy. in the English language the inches and foot/feet measurement phrases are converted into " or ' as 
appropriate. 

\ [0070] Other, non-essential punctuation marks are subject to special handling and are given the ShortType "Nones- 
sentialPunct". Punctuation inside factoids (such as email addresses, URL's, numeric ranges, etc.) Is left intact Punc- 
tuation not Inside such factoids can be deleted except for EssentailPunct and punctuation that occurs as a conjunction 
45 (e.g.. semi-colons to separate clauses), 

[0071] A number of verbs are also subject to special handling and are given the ShortType "Verbs". Such verbs are 
subject of dictionary lookups. For example, the word "are" can be replaced by the letter "R", and the word "be" can be 
replaced by "B". Otherwise, verbs are simply subjected to character reduction and case normalization as described 
below. 

[0072] Two other fonns of special handling are performed as well. One is given the ShortType "WordSubstitution" 
which involves substituting words, and the other is the handling of the "wh" words discussed above. A more detailed 
discussion of those types of special handling is given later in the description. 

[0073] Discussion now proceeds again with respect to FIGS. 5A and 5B. If none of these special handling cases are 
Mo be undertaken at block 244 In FIG. 5A. then the ShortForm attribute associated with the word under analysis is 
simply set to the LongFomi attribute (which, is the form of the word written in the text). This is indicated by block 246. 
[0074] However, if, at block 244, it is determined that special handling is to be done, it is next determined whether 
the special handling is word substitution. Word substitution is often simply perfonned based on a dictionary lookup. 
Word substitution can be perfomied, for example, to obtain an acronym for another word or phrase. For Instance, in 
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the English language the phrase "as soon as possible" can be substituted with "ASAP". 

[0075] If the special handling is word substitution, then the necessary word substitution is performed for the word in 
the text In order to obtain the ShortForm attribute. This is indicated by block 250. If word substitution is successful, 
then the CaseNormalizedForm (CNF) attribute and the CompressedFonn (Comp) attribute are both set to the same 

5 form as now found in the ShortForm attribute. This removes the word from further processing such as character re- 
duction and case normalization. This Is indicated by block 252. Therefore, the word substitution process can be used 
to avoid other troublesome situations as well. For example, in German the pronoun "sich" can be required (by word 
substitution) to remain "sIch" In order to avoid later vowel deletion which would result In a common abbreviation for an 
obscenity. Determining whether the special handling is word substitution is indicated by block 248. 

10 [0076] if, at block 248, it is detennlned that the particular type of special handling to be undertaken Is not word 
substitution, then it is detemnined at block 254 whether the special handling to be undertaken is that associated With 
the "wh" words mentioned above. If so, recall that the %vh" words are not to be reduced, in that case, all remaining 
attributes (ShortFomt, CaseNormallzedFomri, and CompressedForm) are set to the LongFonn. This is indicated by 
block 256. 

IS [0077] If, at block 254, it Is determined that the special handling to be undertaken Is not that associated with the "wh" 
words, then It must be one of the other special handling operations discussed above. In that case, the particular special 
handling step is perfomned to obtain the ShortForm attribute and the ShortForm attribute Is saved. This is Indicated by 

block 258. 

[0078] Once the special handling has been perfonned and the ShortFonn attribute has been obtained, the ShortFonn 
20 attribute is submitted for space removal. It Is first determined whether space removal Is to be done. This is Indicated 
by block 260. if so, then the short form is submitted to a space removal algorithm such as that set out in the following 
pseudocode. 
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.V Classify each token as 

1 <EssentialPunct>: assume these need no delineation, and can serve to delineate all 

tokens 

<Ca5eDelineable>: includes all normal words/phrases etc where we can nomialize the 
case 

<Numt>er>: numbers (note that these include tokens like 'two** that have been 
converted to •^") 

<SpaceDeHneable>: tokens that must have a space around them - like urTs and email 
addresses 

One embodiment of the algorithm: 
// start off with the short form sans leading spaces 
Result = RemoveLeadingSpaces(<short fomi>) 
'% II only do this if the token Is not NULL 

if (Result) { 

FrontSpaceNeeded = FALSE; 
// switch on type of current token 
switch <curtype> { 
case <EssentialPunct>: 

II should be all done. No delineation required 
break; 

case <CaseDellneabte>: 

// put in a space if prev type was space delineabte 

if (prevtype == <SpaceDelineable>) FrontSpaceNeeded = TRUE; 

break; 

v\ case <Number>: 

* // put in a space if prev type is number or space delineable 

if (prevtype == <SpaceDelineable> || prevtype « <Number> 11 
PrevlousToken ends in a digit) FrontSpaceNeeded = TRUE; 
break; 

case <SpaceDetineable>: 

// put in a space unless previous token was essential punctuation 
if (prevtype N <EssentialPunct> && llsFirstTokenlnSentence) 
FrontSpaceNeeded = TRUE; ' 
break; 

} 

// set prevtype to current type 
prevtype = curtype; 

% if (FrontSpaceNeeded) Result = AddLeadingSpace(<Result>) 



[0079J The pseudocode Indicates that spaces will not be removed preceding URLs, email addresses, etc., nor will 
they be removed following those items. However, in other cases, where delineation can be made, spaces will be re- 
moved from the ShortForm attribute. This is Indicated by block 262. 

[0080] Next, it is detemiined whether case nomializatlon is to be performed. This is indicated by block 264. It will be 
appreciated, for example, that case nomialization may not be desired in URLs and emails and other such items that 
are case sensitive. If that is the case, then the CaseNonnalizedForm attribute Is set to the ShortFomi attribute as 
Indicated by block 266. However, If case normalization is to be performed, then the first letter in each word in the 

^ ShortFomi attribute (recall that the token can be composed of multiple words) Is capitalized, and that is saved as the 

' CaseNormallzcdFonn attribute. This is indicated by block 268, 

[0081] It is next detemiined whether further compression is to be performed. This is indicated by block 270, For 

example, in a number of the special handling cases mentioned above, vowel removal is not to be perfomied (such as 

In pronouns In the English language, the "wh" words, proper names or in the ShortForm of days such as Mon, Tues, 

etc.). SImllariy, vowels or consonants are not to be removed from acronyms, email addresses, URLs, etc. 

[0082] If further compression is not to be perfomied, then the CompressedFomi attribute is set to the CaseNomial- 

izedFomi attribute as indicated in block 272. However, if further compression Is to be preformed, then the CaseNor- 

malizedFomi is submitted for character reduction (such as the removal of vowels and consonants). 

[0083] For the present discussion, the temi "medial vowels" will mean a single vowel or a sequence of vowels that 
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is not either at the beginning or at the end of a word. In the English language, all medial vowels are removed. 
[0084] For removing letters in German, consonant cluster simplification rules are first applied. For example, the 
consonant cluster "sch" is simplified to "sh" except In the diminutive suffix -schen. Also, the consonant cluster "ck" Is 
simplified to "k". 

[0085] Next, the word-final sequence-ein is replaced with the homophonous -1 . Some words In German end in -ein, 
but it is not homophonous with the number one. Some examples of such words are the following: 

Godein, Coffein, Casein, Fluoreszein, Hussein, Kaffeln, Kasein, Kleberprotein, Kodein, Lutein, Moveln. Nuklein, 
Nuclein, Olein. Phenolphtalein, Phtalein, Protein, Pygmaein, Talein, Tein, Thein, Zein, 

Zygstein 



[0086] It should also be noted that If the following word is a number, date, time, etc. (such as anything which may 
start with a digit), then the "ein" substitution is not performed. 

*5 [0087] In German, in words that contain only one medial vowel, the vowel is not deleted. For words with more than 
one medial vowel, every second medial vowel is deleted. The letter "u" between a consonant and a word-flnal "ng" Is 
deleted. Any cases of "ie" that still remain are converted to "i". Finally, the letter "e" Is deleted if It follows a consor^nt 
and precedes a word-final "I, m, n or r". Note that a vowel is not deleted If it follows the letter s and precedes the cluster 
ch since this would result In the sequence sch which German readers have a very strong tendency to interpret as the 

20 beginning of a syllable. For the present discussion, vowels typically Include aelou and in some languages y, and all . 
fomis with accents, umlauts, and other diacritics. A list sufficient for English, Gemrian, French and Spanish is: 



«a^^aaeeee3iiiXceo6a6uauQ£AAAAA££:£:£ClXtIl(£0606uUOO 
[0088] For English, Gennan, French and Spanish, consonants include: 

qwrtypsdfghjklzxcgvbnftoQWRTYPSDFGHJKLZXCCVBNfiMfi 



although additional consonant symbols may be added for other languages. 

[0089] Once character reduction (such as vowel and consonant removal) is perfonned, as indicated by block 274, 
3s the CompressedForm attribute is obtained and saved. This is indicated by block 276. Finally, all five attributes can be 
output as potential compression options. This is indicated by block 278. 

[0090] It should also be noted that during traversal of the syntax parse tree, compression can be perfomied on a 
non-terminal node level as well. In one embodiment, entire phrases are deleted based on the syntactic analysis. For 
example, consider the sentence "While I was stuck on the freeway, I remembered to ask you to send me the contact 

40 information for Dr. Mary Smith". In this example, the entire sentence initial subordinate clause can be deleted. In other 
words, the syntactic analysis indicates that It is subordinate and the subordinating conjunction "while" indicates ttat 
this is a temporal adverbial clause. Therefore, this entire phrase can simply be deleted to obtain the sentence "1 re* 
membered to ask you to send me the contact infomatlon for Dr. Mary Smith." The patent application Serial No. 
„09/220,836, entitled SYSTEM FOR IMPROVING THE PERFORMANCE OF INFORMATION IDENTIFYING CU^US- 
ES HAVING PERDETERMINED CHARACTERISTICS, filed on December 24, 1998, provides additional Infomnatlon 
regarding the identification of subordinate clauses and whether those clauses contain relatively important material. 
[0091] Another example of compressing at the non-temninal node level Is with respect to speech act verbs. Speech 
act verbs are a subclass of what linguists refer to as "complement taking predicates." In the English language, an 
ambiguity is illustrated in the following sentence: 

50 [0092] "John said that he was arriving next Wednesday." 

[0093] In one reading, the word "he" is co-referential with "John". In another reading, "he" could be sorneone else. 
Some elements of this sentence can be deleted without making the output any or more less ambiguous than the input, 
as follows; " 
[0094] If the subject of the matrix clause speech act verb (in this case "John" the subject of "said") is possibly co- 

S5 referential with a pronominal subject of the subordinate clause (he), and this can be determined either by noting that 
they are both masculine, as we know from a morphology lookup, or by using more sophisticated semantic analysis to 
determine co-reference, then the pronoun In the subordinate clause can be deleted. Note that the subordinating con- 
junction "that" can also be deleted, to yield: 
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"John said was aniving next Wednesday". 

[0095] It should be noted that care must be taken to only delete the subject of the subordinate clause when it is a 
pronoun, and possibly co-referential with the subject of the main clause. For example, it should not be deleted In the 

following case: 

John said that she was arriving... 
. John said that Bill was arrMng... 
^ John said that they were arriving... 

[0096] At this point, following through with the example of the sentence Illustrated In FIG. 3 may be helpful. As stated 
eariier, each node in the analysis is iterativeiy examined to determine whether compression can be accomplished. 
Therefore, the sentence node (S) is first examined. No compression can be done at this point, so processing proceeds 
deeper in the analysis and the noun phrase node 300 is examined. No compression can be performed at that level so 
processing continues deeperto the pronoun node 302. It is seen that the pronoun is "you". Therefore, under the special 
handling provisions, this can be converted the tenn "U". This results In the following attributes: 



ShortType = Pronouns 
LongForm = You 
20 ShortFomi = U 

CNF=U 
Comp. = U 



25 



[0097] Next processing continues with respect to vert) phrase node 304. It Is seen that no compression can be 
perfomied at this level so the verb node 306 is examined. The tenn "have" is simply passed through the flow chart 
illustrated in FIGS. 5A and 5B and subjected to case nomialization and vowel removal to obtain the tenn "Hve". This 
results in the attributes as follows (wherein the underscore represents a leading space): 

ShortType = VerbsDefault 
^0 LongForm - _have 
ShortFomfi=_hav6 
'^^ CNF = Have 
Comp. = Hve 

35 [0098] Again, examination of the node 308 is done and It is found that no compression can be done at this level. 
Therefore, examination proceeds to node 310 where the article "a" Is deleted at block 240 In FIG. 5A to yield: 

ShortType = Articles 
LongForm = _a 
40 ShortFonn = Null 

CNF = Null 
Comp. = Null 



45 



i [0099] The node 31 2 is then examined, and is subjected to word substitution to result in the five attributes as follows: 



ShortType = WordSubstltution 
LongFonn = ^meeting 
ShortForm = Mtg 
CNF=:Mtg 
50 Comp. = Mtg 



55 



[0100] The prepositional phrase node 314 is then examined and it is detennined that no compression can be done 
at that level. Therefore, the preposition node 31 6 is examined. Processing moves though the flow chart in FIGS. 5A 
and 5B and case nonnallzation and vowel removal are conducted to yield the five attributes as follows: 

• ShortType = PrepsDefault 
LongFonn = _with 
ShortForm = with 
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CNF = With 
Comp. = Wth 

[0101] The proper noun node 31 8 is then examined. It is found, at this node, the three words "Dr. John Epstein" can 
^ be compressed using the ShortType PrprN. This yields the five attributes as follows: 

ShortType = PrprN 

LongFomi = _Dr. _John_Epsteln 

ShortForm = .Epstein 

"'y'-^':-^- 10 CNF = Epstein 

Comp. = Epstein 

[0102] Next, node 320 is examined and is found that this phrase represents an offset date. This is analyzed, through 
the flow diagram illustrated in FIGS. 5A and 5B to yield the following five attributes: 

IS 

ShortType = OffsetDate 
LongFonn = _next:_Tuesday 
ShortForm=_12/3 
CNF =12/3 
20 Comp. = 12/3 

[0103] Next, node 322 is examined and it is determined that no compression can be made at that node. Therefore 

the preposition node 324 is examined. It is noted, through processing as indicated in FIGS. SA and SB that the term 
"at" Is the subject of a word substitution for "@" this yields the five attributes as follows: \. 

ShortType = WordSubstitutlon 
LongForm = _at 
ShortFomi = @ 
CNF=@ 
30 Comp. = @ 

[0104] Finally, the node 326 Is examined and the only compression that Is found is to replace the spelled-out term 
"ten" with the number "10" to yield the five attributes: 

35 ShortType = Numbers 

LongForm = _ten_am \ :. 

ShortForm = _1 Gam 
CNF = 10am 
Comp. = 10am 

[0105] The compressor 202 Is then free to picic and choose among the various compression options illustrated in 
these data structures to provide a final output compressed version of the text. This can be done very aggressively, as 
In the case of the display screen on the target device 204 with a very limited size, or it can be done less aggressively, 
as in the case of a palm top computer with more display space, for instance. Therefore, for example, the most aggressive 
compression is as follows: 

UHveMtgWthEpsteinl2/3@1 0am 

[01 06] Even with very aggressive compression, this is a highly readable and decipherable text message, yet is saves 
a great deal of space over the original set out in FIG. 3. 
':ti\^.-y.T'm* [0107] Thus, it can be seen that the present invention can be used to provide significant compression, yet the com- 

pression is made In a highly linguistically Intelligent fashion such that it can be easily deciphered by a human. It also 
provides a plurality of different compression options for individual words and phrases, which, In most cases, reflect 
various degrees of aggressiveness. This is tremendously helpful to the downstream components which eventually 
S5 must choose the best compression sequence in the target device. 

[01 08] Although the present invention has been described with reference to particular embodiments, workers skilled 
in the art will recognize that changes may be made In form and detail without departing from the spirit and scope of 
the invention. 
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J Claims 



•vi 

^ 1 . A method of processing a body of text to generate compression options, comprising: 

perfomiing a linguistic analysis on the body of text to obtain a linguistic output indicative of linguistic compo- 
nents of the body of text; and 

generating a plurality of compression options to compress the body of text based on the linguistic ou^ut. 

2. The method of claim 1 wherein generating a plurality of compression options comprises: 

subjecting a portion of the body of text to different sets of compression rules to obtain the plurality of compres- 
sion options. 

3. The method of claim 2 wherein subjecting the body of text to different sets of compression rules, comprises: 

subjecting the portion of the body of text to the different sets of compression rules In a predetennlned order 
such that the compression options reflect varying degrees of compression of a same portion of the body of text. 

4. The method of claim 4 wherein generating a plurality of compression options comprises: 

generating a compression identifier attribute Indicative of at least one of the sets of compression rules to which 
the portion of the body of text is subjected. 

5. The method of claim 4 wherein generating a plurality of compression options comprises: 

1| generating a ShortFomi attribute indicative of a compressed form of the portion of the body of text after ap- 

plication of the set of compression rules. 

6- The method of claim 5 wherein generating a plurality of compression options comprises: 

generating a case nomialized attribute, based on the ShortFonn attribute, indicative of a CaseNomializedForm 
of the ShortForm attribute. 

7. The method of claim 6 wherein generating a plurality of compression options comprises: 

generating a compression attribute indicative of a further compressed form of the case nomiallzed attribute. 

is. The method of claim 7 wherein generating a compression attribute comprises: 

applying letter removal rules to the case normalized attribute to remove letters based on a predetermined 
location of the letters in the CaseNormallzedForm. 

9. The mettiod of claim 8 wherein generating a plurality of compression options comprises: 

generating a LongFonn attribute that reflects substantially no compression of the portion of the body of text 

10. The method of claim 9 wherein one ShortForm attribute comprises a word substitution based on a dictionary look- 
up and wherein generating a plurality of compression options comprises: 

setting the case nomialized attribute and the compression attribute to the ShortForm attribute. 

11. The method of claim 5 wherein perfomiing a linguistic analysis comprises performing a syntactic analysis on the 
portion of the body of text and wherein generating the ShortForm attribute comprises: 

applying the set of compression rules based on the syntactic analysis. 

1 2. The method of claim 11 wherein the linguistic analysis further comprises, prior to performing the syntactic analysis: 
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performing a lexical analysis on the body of text; and 
performing a morphological analysis on the body of text. 

13. The method of claim 5 wherein generating the ShortFonn attribute comprises: 

normalizing dates to a numerical form. v 

14. The method of dalm 5 wherein generating the ShortFonn attribute comprises: 

normalizing offset dates to a numerical fonn, based on a date that the body of text was authored. 

15. The method of claim 5 wherein generating the ShortFomn attribute comprises: 

maintaining symbol-sensitive text fragments in uncompressed form. 

1 6. The method of claim 1 5 wherein maintaining symbol-sensitive text fragments comprises: 

maintaining text fragments that, cannot be accurately understood unless maintained fully in-tact, in uncom- 
pressed form. 

17. The method of claim 16 wherein maintaining text fragments comprises: 

maintaining unifomi resource locators and electronic mail addresses in uncompressed fonn. 

18. The method of claim 11 wherein the syntactic analysis includes a tree having non-temiinal nodes representing 
muiti-word portions of the body of text and temiinal nodes indicative of words in the body of text, and wherein both 
the non-temnlnal nodes and the temiinal nodes are examined for application of compression rules. 

19. A data structure fomied from an analysis of a portion of a body of text indicative of a plurality of compressed fo^s 
of the portion of the body of text, the data structure comprising: 

a plurality of data fields, representing a plurality of compressed fomns of the portion of the body of text. 

20. A message handler receiving a message and generating compression options indicative of different forms a portion 
of a body of text In the message, the message handler comprising: 

a linguistic analyzer iinguistically configured to analyze the body of text and provide a linguistic analysis; and 
a compression fonn generator configured to generate a plurality of compressed forms of a portion of the body 
of text based on the linguistic analysis. 
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You have a meeting with Dr. John Epstein next Tuesday at ten am 



FIG. 3 
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